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Abstract 

We consider rules for discarding predictors in lasso regression and 
related problems, for computational efficiency. El Ghaoui et al. (20f0) 
propose "SAFE" rules, based on univariate inner products between each 
predictor and the outcome, that guarantee a coefficient will be zero in 
the solution vector. This provides a reduction in the number of variables 
that need to be entered into the optimization. In this paper, we propose 
strong rules that are not foolproof but rarely fail in practice. These are 
very simple, and can be complemented with simple checks of the Karush- 
Kuhn- Tucker (KKT) conditions to ensure that the exact solution to the 
convex problem is delivered. These rules offer a substantial savings in both 
computational time and memory, for a variety of statistical optimization 
problems. 



1 Introduction 

Our focus here is on statistical models fit using l\ regularization. We start 
with penalized linear regression. Consider a problem with N observations and 
p predictors, and let y denote the iV-vector of outcomes, and X be the N x p 
matrix of predictors, with jth column Xj and ith row Xi. For a set of indices 
A = {ji , . . . jk}, we write to denote the N xk submatrix X_4 = [x^ , . . . Xj fc ] , 
and also b_4 = (bj 1 , . . . bj k ) for a vector b. We assume that the predictors and 
outcome are centered, so we can omit an intercept term from the model. 
The lasso Tibshirani (1996) solves the optimization problem 

/3 = argmin i||y - X/3||| + \\\f3\\ u (1) 
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where A > is a tuning parameter. There has been considerable work in the 
past few years deriving fast algorithms for this problem, especially for large 
values of N and p. A main reason for using the lasso is that the l\ penalty 
tends to give exact zeros in (3, and therefore it performs a kind of variable 
selection. Now suppose we knew, a priori to solving (1), that a subset of the 
variables S C {1, . . .p} will have zero coefficients in the solution, that is, 0s = 0. 
Then we could solve problem (1) with the design matrix replaced by Xgc, where 
S° — {1, . . .p} \ S, for the remaining coefficients 0s<=- If S is relatively large, 
then this could result in a substantial computational savings. 

El Ghaoui et al. (2010) construct such a set S of "screened" or "discarded" 
variables by looking at the inner products |xjy|, j = l,...p. The authors 
use a clever argument to derive a surprising set of rules called "SAFE" , and 
show that applying these rules can reduce both time and memory in the overall 
computation. In a related work, Wu et al. (2009) study l\ penalized logistic 
regression and build a screened set S based on similar inner products. However, 
their construction does not guarantee that the variables in S actually have zero 
coefficients in the solution, and so after fitting on Xgc, the authors check the 
Karush-Kuhn- Tucker (KKT) optimality conditions for violations. In the case 
of violations, they weaken their set S, and repeat this process. Also, Fan & 
Lv (2008) study the screening of variables based on their inner products in the 
lasso and related problems, but not from a optimization point of view. Their 
screening rules may again set coefficients to zero that are nonzero in the solution, 
however, the authors argue that under certain situations this can lead to better 
performance in terms of estimation risk. 

In this paper, we propose strong rules for discarding predictors in the lasso 
and other problems that involve lasso-type penalties. These rules discard many 
more variables than the SAFE rules, but are not foolproof, because they can 
sometimes exclude variables from the model that have nonzero coefficients in 
the solution. Therefore we rely on KKT conditions to ensure that we are indeed 
computing the correct coefficients in the end. Our method is most effective 
for solving problems over a grid of A values, because we can apply our strong 
rules sequentially down the path, which results in a considerable reduction in 
computational time. Generally speaking, the power of the proposed rules stems 
from the fact that: 

• the set of discarded variables S tends to be large and violations rarely 
occur in practice, and 

• the rules are very simple and can be applied to many different problems, 
including the elastic net, lasso penalized logistic regression, and the graph- 
ical lasso. 

In fact, the violations of the proposed rules are so rare, that for a while a group 
of us were trying to establish that they were foolproof. At the same time, others 
in our group were looking for counter-examples [hence the large number of co- 
authors!]. After many flawed proofs, we finally found some counter-examples to 
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the strong sequential bound (although not to the basic global bound). Despite 
this, the strong sequential bound turns out to be extremely useful in practice. 

Here is the layout of this paper. In Section 2 we review the SAFE rules of El 
Ghaoui et al. (2010) for the lasso. The strong rules are introduced and illustrated 
in Section 3 for this same problem. We demonstrate that the strong rules rarely 
make mistakes in practice, especially when p >• N. In Section 4 we give a 
condition under which the strong rules do not erroneously discard predictors 
(and hence the KKT conditions do not need to be checked). We discuss the 
elastic net and penalized logistic regression in Sections 5 and 6. Strong rules for 
more general convex optimization problems are given in Section 7, and these are 
applied to the graphical lasso. In Section 8 we discuss how the strong sequential 
rule can be used to speed up the solution of convex optimization problems, 
while still delivering the exact answer. We also cover implementation details of 
the strong sequential rule in our glmnet algorithm (coordinate descent for lasso 
penalized generalized linear models). Section 9 contains some final discussion. 



2 Review of the SAFE rules 

The basic SAFE rule of El Ghaoui et al. (2010) for the lasso is defined as follows: 
fitting at A, we discard predictor j if 

|xJy|<A-||x,|| 2 ||y|| 2 ^^, (2) 

^max 

where A max = max^ |xf y| is the smallest A for which all coefficients are zero. 
The authors derive this bound by looking at a dual of the lasso problem (1). 
This is: 

8= a rgm ax G(0) = hy\\j-hy + 9\\t (3) 
e ll 

subject to |xj#| < A for j = 1, . . .p. 
The relationship between the primal and dual solutions is 6 = X/3 — y, and 

( {+A} if fa > 
xj0 e < {-A} if ft < (4) 
[[-A, A] if&=0 

for each j — 1 , ... p. Here is a sketch of the argument: first we find a dual feasible 
point of the form 6q = sy, (s is a scalar), and hence 7 = G(sy) represents a 
lower bound for the value of G at the solution. Therefore we can add the 
constraint G{0) > 7 to the dual problem (3) and nothing will be changed. For 
each predictor j, we then find 

rrij = argmax |xj<?| subject to G{9) > 7. 
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If nij < A (note the strict inequality), then certainly at the solution |xj#| < A, 
which implies that j3j = by (4). Finally, noting that s = A/A max produces a 
dual feasible point and rewriting the condition mj < A gives the rule (2). 

In addition to the basic SAFE bound, the authors also derive a more compli- 
cated but somewhat better bound that they call "recursive SAFE" (RECSAFE). 
As we will show, the SAFE rules have the advantage that they will never discard 
a predictor when its coefficient is truly nonzero. However, they discard far fewer 
predictors than the strong sequential rule, introduced in the next section. 

3 Strong screening rules 

3.1 Basic and strong sequential rules 

Our basic (or global) strong rule for the lasso problem (1) discards predictor j 
if 

|xjy| < 2A - A max , (5) 

where as before A max = max^ |xjy|. 

When the predictors are standardized (||xj H2 = 1 for each j), it is not difficult 
to see that the right hand side of (2) is always smaller than the right hand side 
of (5), so that in this case the SAFE rule is always weaker than the basic strong 
rule. This follows since A max < ||y||2, so that 

A-!|y||2— p — - < A - (A max - A) = 2A-A max . 

Figure 1 illustrates the SAFE and basic strong rules in an example. 

When the predictors are not standardized, the ordering between the two 
bounds is not as clear, but the strong rule still tends to discard more variables 
in practice unless the predictors have wildly different marginal variances. 

While (5) is somewhat useful, its sequential version is much more powerful. 
Suppose that we have already computed the solution /3(Ao) at Ao, and wish to 
discard predictors for a fit at A < Ao. Defining the residual r = y X/3(Ao), 
our strong sequential rule discards predictor j if 

|xjr| < 2A - A . (6) 

Before giving a detailed motivation for these rules, we first demonstrate their 
utility. Figure 2 shows some examples of the applications of the SAFE and 
strong rules. There are four scenarios with various values of N and p; in the 
first three panels, the X matrix is dense, while it is sparse in the bottom right 
panel. The population correlation among the feature is zero, positive, negative 
and zero in the four panels. Finally, 25% of the coefficients are non-zero, with 
a standard Gaussian distribution. In the plots, we are fitting along a path 
of decreasing A values and the plots show the number of predictors left after 
screening at each stage. We see that the SAFE and RECSAFE rules only exclude 
predictors near the beginning of the path. The strong rules are more effective: 
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Figure 1: SAFE and basic strong bounds in an example with 10 predictors, 
labelled at the right. The plot shows the inner product of each predictor with the 
current residual, with the predictors in the model having maximal inner product 
equal to ±A. The dotted vertical line is drawn at X max ; the broken vertical line 
is drawn at A. The strong rule keeps only predictor #3, while the SAFE bound 
keeps predictors #8 and #1 as well. 
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remarkably, the strong sequential rule discarded almost all of the predictors that 
have coefficients of zero. There were no violations of any of rules in any of the 
four scenarios. 

It is common practice to standardize the predictors before applying the lasso, 
so that the penalty term makes sense. This is what was done in the examples 
of Figure 2. But in some instances, one might not want to standardize the 
predictors, and so in Figure 3 we investigate the performance of the rules in this 
case. In the left panel the population variance of each predictor is the same; in 
the right panel it varies by a factor of 50. We see that in the latter case the 
SAFE rules outperform the basic strong rule, but the sequential strong rule is 
still the clear winner. There were no violations in any of rules in either panel. 



3.2 Motivation for the strong rules 



We now give some motivation for the strong rule (5) and later, the sequential 
rule (6). We start with the KKT conditions for the lasso problem (1). These 
are 

xJ(y-X/3) = A- Sj (7) 
for j = l,...p, where Sj is a subgradient of Bf 



8j e < 



{+1} 
{-1} 



if f3, > 
if Bi < 



(8) 



I [-1,1] if Pj 



0. 



Let Cj(X) — xj{y— X/3(A)}, where we emphasize the dependence on A. Suppose 
in general that we could assume 



|c'(A)| < 1, 



(9) 



where c'j is the derivative with respect to A, and we ignore possible points of 
non-differentiability. This would allow us to conclude that 



|cj(A max ) -Cj(A)| 



< 



< A n 



4(A) dX 

\c'j(X)\dX 
A, 



(10) 
(11) 



and so 

MA max )| < 2A - A max => \cj(X)\ < X => Pj (A) = 0, 

the last implication following from the KKT conditions, (7) and (8). Then the 
strong rule (5) follows as /3(A max ) = 0, so that |Cj(A max )| = |xjyj. 

Where does the slope condition (9) come from? The product rule applied to 
(7) gives 

c' j (X) = s j (X) + X-s' j (X), (12) 
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Figure 2: Lasso regression: results of different rules applied to four different 
scenarios. There are four scenarios with, various values of N and p; in the first 
three panels the X matrix is dense, while it is sparse in the bottom right panel. 
The population correlation among the feature is zero, positive, negative and zero 
in the four panels. Finally, 25% of the coefficients are non-zero, with a standard 
Gaussian distribution. In the plots, we are fitting along a path of decreasing A 
values and the plots show the number of predictors left after screening at each 
stage. A broken line with unit slope is added for reference. The proportion of 
variance explained by the model is shown along the top of the plot. There were 
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Figure 3: Lasso regression: results of different rules when the predictors are not 
standardized. The scenario in the left panel is the same as in the top left panel 
of Figure 2, except that the features are not standardized before fitting the lasso. 
In the data generation for the right panel, each feature is scaled by a random 
factor between 1 and 50, and again, no standardization is done. 
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and as |sj(A)| < 1, condition (9) can be obtained if we simply drop the sec- 
ond term above. For an active variable, that is j3j(X) 7^ 0, we have Sj(X) = 
sign{/3j(A)}, and continuity of $j(X) with respect to A implies s'j(X) = 0. But 
s'j{\) ^ for inactive variables, and hence the bound (9) can fail, which makes 
the strong rule (5) imperfect. It is from this point of view — writing out the 
KKT conditions, taking a derivative with respect to A, and dropping a term — 
that we derive strong rules for l\ penalized logistic regression and more general 
problems. 

In the lasso case, condition (9) has a more concrete interpretation. From 
Efron et al. (2004), we know that each coordinate of the solution f3j(\) is a 
piecewise linear function of A, hence so is each inner product Cj(X). Therefore 
Cj (A) is diffcrentiable at any A that is not a kink, the points at which variables 
enter or leave the model. In between kinks, condition (9) is really just a bound 
on the slope of Cj(X). The idea is that if we assume the absolute slope of Cj(X) is 
at most 1, then we can bound the amount that Cj(X) changes as we move from 
A mM to a value A. Hence if the initial inner product Cj(A max ) starts too far 
from the maximal achieved inner product, then it cannot "catch up" in time. 
An illustration is given in Figure 4. 

The argument for the strong bound (intuitively, an argument about slopes), 
uses only local information and so it can be applied to solving (1) on a grid of A 
values. Hence by the same argument as before, the slope assumption (9) leads 
to the strong sequential rule (6). 

It is interesting to note that 

|xjr| < A (13) 

is just the KKT condition for excluding a variable in the solution at A. The 
strong sequential bound is A — (Ao — A) and we can think of the extra term Ao — A 
as a buffer to account for the fact that |xjr| may increase as we move from A 
to A. Note also that as A — >■ A, the strong sequential rule becomes the KKT 
condition (13), so that in effect the sequential rule at A "anticipates" the KKT 
conditions at A. 

In summary, it turns out that the key slope condition (9) very often holds, 
but can be violated for short stretches, especially when p ss N and for small 
values of A in the "overfit" regime of a lasso problem. In the next section we 
provide an example that shows a violation of the slope bound (9), which breaks 
the strong sequential rule (6). We also give a condition on the design matrix 
X under which the bound (9) is guaranteed to hold. However in simulations 
in that section, we find that these violations are rare in practice and virtually 
non-existent when p » N. 
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Figure 4: Illustration of the slope bound (9) leading to the strong rule (6). The 
inner product Cj is plotted in red as a function of X, restricted to only one 
predictor for simplicity. The slope of Cj between A max and Ai is bounded in 
absolute value by 1, so the most it can rise over this interval is A max — \\ . 
Therefore, if it starts below \\ — (A max — Ai) = 2Ai — A max , it can not possibly 
reach the critical level by \\. 



4 Some analysis of the strong rules 
4.1 Violation of the slope condition 

Here we demonstrate a counter-example of both the slope bound (9) and of 
the strong sequential rule (6). We believe that a counter-example for the basic 
strong rule (5) can also be constructed, but we have not yet found one. Such an 
example is somewhat more difficult to construct because it would require that 
the average slope exceed 1 from A max to A, rather than exceeding 1 for short 
stretches of A values. 

We took N = 50 and p — 30, with the entries of y and X drawn inde- 
pendently from a standard normal distribution. Then we centered y and the 
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columns of X, and standardized the columns of X. As Figure 5 shows, the 
slope of Cj (X) = xj{y - X/3(A)} is cJ(A) = -1.586 for all A e [Ai,A ], where 
Ai = 0.0244, A = 0.0259, and j = 2. Moreover, if we were to use the solution 
at Ao to eliminate predictors for the fit at Ai, then we would eliminate the 2nd 
predictor based on the bound (6). But this is clearly a problem, because the 
2nd predictor enters the model at Ai. By continuity, we can choose Ai in an 
interval around 0.0244 and Ao in an interval around 0.0259, and still break the 
strong sequential rule (6). 

4.2 A sufficient condition for the slope bound 

Tibshirani & Taylor (2010) prove a general result that can be used to give the 
following sufficient condition for the unit slope bound (9). Under this condition, 
both basic and strong sequential rules are guaranteed not to fail. 

Recall that a matrix A is diagonally dominant if \Au\ > f° r au 

Their result gives us the following: 

Theorem 1. Suppose that X is N x p, with N > p, and of full rank. If 

(X T X) _1 is diagonally dominant, (14) 

then the slope bound (9) holds at all points where Cj(X) is differentiable, for 
j = l,...p, and hence the strong rules (5), (6) never produce violations. 

Proof. Tibshirani & Taylor (2010) consider a generalized lasso problem 

argmin i||y - X/3||| + A||D/3|| 1; (15) 
P 2 

where D is a general m x p penalty matrix. In the proof of their "boundary 
lemma", Lemma 1, they show that if rank(A) = p and D(X T X) _1 D T is di- 
agonally dominant, then the dual solution u(A) corresponding to problem (15) 
satisfies 

\uj{\) - Uj(Xo)\ < |A - A | 

for any j = 1, . . . m and A, Ao- By piecewise linearity of Uj(X), this means that 
|Mj(A)| < 1 at all A except the kink points. Furthermore, when D = I, problem 
(15) is simply the lasso, and it turns out that the dual solution iij(X) is exactly 
the inner product Cj(X) = xj{y — X/3(A)}. This proves the slope bound (9) 
under the condition that (X T X) _1 is diagonally dominant. 

Finally, the kink points are countable and hence form a set of Lebesgue mea- 
sure zero. Therefore Cj(A) is differentiable almost everywhere and the integrals 
in (10) and (11) make sense. This proves the strong rules (5) and (6). □ 

We note a similarity between condition (14) and the positive cone condition 
used in Efron et al. (2004). It is not hard to see that the positive cone condition 
implies (14), and actually (14) is a much weaker condition because it doesn't 
require looking at every possible subset of columns. 
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Figure 5: Example of a violation of the slope bound (9), which breaks the strong 
sequential rule (6). The entries of y and X were generated as independent, 
standard normal random variables with N = 50 and p = 30. (Hence there is no 
underlying signal.) The lines with slopes ±A are the envelop of maximal inner 
products achieved by predictors in the model for each A. For clarity we only 
show a short stretch of the solution path. The rightmost vertical line is drawn 
at Ao, and we are considering the new value Ai < Ao, the vertical line to its 
left. The horizontal line is the bound (9) . In the top right part of the plot, the 
inner product path for the predictor j = 2 is drawn in red, and starts below the 
bound, but enters the model at Ai . The slope of the red segment between Ao and 
Ai exceeds 1 in absolute value. A dotted line with slope -1 is drawn beside the 
red segment for reference. 
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A simple model in which diagonal dominance holds is when the columns 
of X arc orthonormal, because then X T X = I. But the diagonal dominance 
condition (14) certainly holds outside of the orthogonal design case. We give 
two such examples below. 

• E qui- correlation model. Suppose that ||xj|| 2 = 1 for all j, and xjx/j = r 
for all j 7^ k. Then the inverse of X T X is 

fX T x) - 1=I .^ 1 ( H T 

^ ' l-r l-r\l + r(p-l) 

where 1 is the vector of all ones. This is diagonally dominant as along as 
r > 0. 

• Haar basis model. Suppose that the columns of X form a Haar basis, the 
simplest example being 



X = 



/ 1 
1 1 



V i i 



1 / 



(16) 



the lower triangular matrix of ones. Then (X T X) _1 is diagonally domi- 
nant. This arises, for example, in the one-dimensional fused lasso where 
we solve 

^ N N 

argmin - V(y 4 - ft) 2 +AV|ft- ft_i|. 

P 2 l= l 

If we transform this problem to the parameters ol\ = 1, on = ft — ft_i for 
i = 2, . . . N, then we get a lasso with design X as in (16). 



4.3 Connection to the irrepresentable condition 

The slope bound (9) possesses an interesting connection to a concept called the 
"irrepresentable condition". Let us write A as the set of active variables at A, 

A = {j-J j (\)^0}, 

and ||b||oo = max^ \bi\ for a vector b. Then, using the work of Efron et al. 
(2004), we can express the slope condition (9) as 

||X^X^(X^X^)- 1 sign(/3^)|| oc < 1, (17) 

where by X^ and X^ c , we really mean (X^) T and (X^c) T , and the sign is 
applied element- wise. 

On the other hand, a common condition appearing in work about model 
selection properties of lasso, in both the finite-sample and asymptotic settings, 
is the so called "irrepresentable condition" ?, which is closely related to the 
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concept of "mutual incoherence" ?. Roughly speaking, if (3q- denotes the nonzero 
coefficients in the true model, then the irrepresentable condition is that 

||X^X r (XfX r )- 1 sign(/3 r )|| 00 < 1 - e (18) 

for some < e < 1. 

The conditions (18) and (17) appear extremely similar, but a key difference 
between the two is that the former pertains to the true coefficients that gener- 
ated the data, while the latter pertains to those found by the lasso optimization 
problem. Because T is associated with the true model, we can put a probability 
distribution on it and a probability distribution on sign(/3r), and then show 
that with high probability, certain designs X are mutually incoherent (18). For 
example, Candes & Plan (2009) suppose that k is sufficiently small, T is drawn 
from the uniform distribution on fc-sized subsets of {l,...p}, and each entry of 
sign(/3-7-) is equal to +1 or —1 with probability 1/2, independent of each other. 
Under this model, they show that designs X with max.,^ |xjxfe| = 0(1/ log p) 
satisfy the irrepresentable condition with very high probability. 

Unfortunately the same types of arguments cannot be applied directly to 
(17). A distribution on T and sign(/3r) induces a different distribution on A 
and sign(/3_4), via the lasso optimization procedure. Even if the distributions of 
T and sign(/3r) are very simple, the distributions of A and sign(/3^) can become 
quite complicated. Still, it does not seem hard to believe that confidence in (18) 
translates to some amount of confidence in (17). Luckily for us, we do not need 
the slope bound (17) to hold exactly or with any specified level of probability, 
because we are using it as a computational tool and can simply revert to checking 
the KKT conditions when it fails. 

4.4 A numerical investigation of the strong sequential rule 
violations 

We generated Gaussian data with N = 100, varying values of the number of 
predictors p and pairwise correlation 0.5 between the predictors. One quarter 
of the coefficients were non-zero, with the indices of the nonzero predictors 
randomly chosen and their values equal to ±2. We fit the lasso for 80 equally 
spaced values of A from X max to 0, and recorded the number of violations of 
the strong sequential rule. Figure 6 shows the number of violations (out of p 
predictors) averaged over 100 simulations: we plot versus the percent variance 
explained instead of A, since the former is more meaningful. Since the vertical 
axis is the total number of violations, we see that violations are quite rare 
in general never averaging more than 0.3 out of p predictors. They are more 
common near the right end of the path. They also tend to occur when p is fairly 
close to N. When p^> N (p = 500 or 1000 here), there were no violations. Not 
surprisingly, then, there were no violations in the numerical examples in this 
paper since they all have p> N. 

Looking at (13), it suggests that if we take a finer grid of A values, there 
should be fewer violations of the rule. However we have not found this to be 
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Figure 6: Total number of violations (out ofp predictors) of the strong sequential 
rule, for simulated data with N = 100 and varying values of p. A sequence of 
models is fit, with decreasing values of A as we move from left to right. The 
features are uncorrelated. The results are averages over 100 simulations. 
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true numerically: the average number of violations at each grid point A stays 
about the same. 

5 Screening rules for the elastic net 

In the elastic net we solve the problem 1 

minimize l||y - X/3|| 2 + ^\ 2 \\(3\\ 2 + Will (19) 

Letting 




we can write (19) as 

minimize^ ||y* - X*/3|| 2 + Ai||/3||i. (21) 

In this form we can apply the SAFE rule (2) to obtain a rule for discarding 
predictors. Now |x* T y*| = |xjy|, ||x*|| = ^||x.,|| 2 + A 2 , ||y*|| = ||y||. Hence 
the global rule for discarding predictor j is 

|xjy| < Ax - ||y|| ■ v /||x J |P + A 2 . Al 7-' Al (22) 

Note that the glmnet package uses the parametrization ((1 — a) A, aX) rather 
than (A 2 , Ai). With this parametrization the basic SAFE rule has the form 

|xjy| < (aX - ||y|| • J\\^ + (1 - a)X ■ ^L_^) ( 23 ) 

The strong screening rules turn out to be the same as for the lasso. With 
the glmnet parametrization the global rule is simply 

|xjy| < a(2A - \ max ) (24) 

while the sequential rule is 

|xjr| < a(2A- A ). (25) 

Figure 7 show results for the elastic net with standard independent Gaussian 
data, n — 100, p = 1000, for 3 values of a. There were no violations in any of 
these figures, i.e. no predictor was discarded that had a non-zero coefficient 
at the actual solution. Again we see that the strong sequential rule performs 
extremely well, leaving only a small number of excess predictors at each stage. 

1 This differs from the original form of the "naive" clastic net in Zou & Hastie (2005) by 
the factors of 1/2, just for notational convenience. 



16 




Figure 7: Elastic net: results for different rules for three different values of the 
mixing parameter a. In the plots, we are fitting along a path of decreasing A 
values and the plots show the number of predictors left after screening at each 
stage. The proportion of variance explained by the model is shown along the 
top of the plot is shown. There were no violations of any of the rules in the 3 
scenarios. 
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6 Screening rules for logistic regression 



Here we have a binary response yt = 0,1 and we assume the logistic model 

Pr(F = l\x) = 1/(1 + cxp(-/3 - x T f3)) (26) 
Letting pi — Pr(Y = l\xi), the penalized log-likelihood is 

e(Po,0) = -Y^bJilogn + (1 - W )Iog(l - Pi )] + AH/31U (27) 

i 

El Ghaoui et al. (2010) derive an exact global rule for discarding predictors, 
based on the inner products between y and each predictor, using the same kind 
of dual argument as in the Gaussian case. 

Here we investigate the analogue of the strong rules (5) and (6). The sub- 
gradient equation for logistic regression is 

X T (y - p(/3)) = A • sign(/3) (28) 

This leads to the global rule: letting p = ly, X max — max|xj(y — p)|, we 
discard predictor j if 

|xj(y-p)| < 2A- \ max (29) 
The sequential version, starting at Ao, uses po = p(/?o(Ao),/3(Ao)): 

|xj(y-p )| <2A-A . (30) 

Figure 8 show the result of various rules in an example, the newsgroup 
document classification problem (Lang 1995). We used the training set cultured 
from these data by Koh et al. (2007). The response is binary, and indicates 
a subclass of topics; the predictors are binary, and indicate the presence of 
particular tri-gram sequences. The predictor matrix has 0.05% nonzero values. 
2 Results for are shown for the new global rule (29) and the new sequential 
rule (30). We were unable to compute the logistic regression global SAFE rule 
for this example, using our R language implementation, as this had a very long 
computation time. But in smaller examples it performed much like the global 
SAFE rule in the Gaussian case. Again we see that the strong sequential rule 
(30), after computing the inner product of the residuals with all predictors at 
each stage, allows us to discard the vast majority of the predictors before fitting. 
There were no violations of either rule in this example. 

Some approaches to penalized logistic regression such as the glmnet package 
use a weighted least squares iteration within a Newton step. For these algo- 
rithms, an alternative approach to discarding predictors would be to apply one 
of the Gaussian rules within the weighted least squares iteration. 

2 This dataset is available as a saved R data object at 
http : / /www-stat . Stanford . edu/ hastie/glmnet 
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Figure 8: Logistic regression: results for newsgroup example, using the new 
global rule (29) and the new sequential rule (30). The broken black curve is the 
45° line, drawn on the log scale. 
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Wu et al. (2009) used |xj(y - p)| to screen predictors (SNPs) in genome- 
wide association studies, where the number of variables can exceed a million. 
Since they only anticipated models with say k < 15 terms, they selected a small 
multiple, say lOfc, of SNPs and computed the lasso solution path to k terms. 
All the screened SNPs could then be checked for violations to verify that the 
solution found was global. 



7 Strong rules for general problems 

Suppose that we have a convex problem of the form 



K 

minimize^ 



/(/3) + A- (31) 



k=l 

where / and g are convex functions, / is differentiable and (3 = (fii,02, ■ ■ ■ @k) 
with each (3u being a scalar or vector. Suppose further that the subgradient 
equation for this problem has the form 

/'(£) + A ■ s fc = 0; k = l,2,...K (32) 

where each subgradient variable satisfies ||sfe|| g < A, and ||sfc|| Q = A when 
the constraint g (f3j) = is satisfied (here ||-|| 9 is a norm). Suppose that we have 
two values A < Ao, and corresponding solutions /3(A), /3(Ao). Then following the 
same logic as in Section 3, we can derive the general strong rule 

||^°*)||,<(l + A)A-A\o (33) 

This can be applied either globally or sequentially. In the lasso regression set- 
ting, it is easy to check that this reduces to the rules (5), (6) where A=l. 

The rule (33) has many potential applications. For example in the graphical 
lasso for sparse inverse covariance estimation (Friedman et al. 2007), we observe 
N multivariate normal observations of dimension p, with mean and covariance 
E, with observed empirical covariance matrix S. Letting = the problem 
is to maximize the penalized log-likelihood 

logdete-tr(Se)-A||9||i, (34) 

over non- negative definite matrices 0. The penalty ||0||i sums the absolute 
values of the entries of 0; we assume that the diagonal is not penalized. The 
subgradient equation is 

E-5-A-r = 0, (35) 

where € sign(0y). One could apply the rule (33) elementwise, and this 
would be useful for an optimization method that operates elementwise. This 
gives a rule of the form \Sij — E(Ao)| < 2A — A . However, the graphical lasso 
algorithm proceeds in a blockwise fashion, optimizing one whole row and column 
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Figure 9: Strong global and sequential rules applied to the graphical lasso. A 
broken line with unit slope is added for reference. 



at a time. Hence for the graphical lasso, it is more effective to discard entire 
rows and columns at once. For each row i, let S12, CT12, and Ti2 denote Si,—i, 
and r^-j, respectively. Then the subgradient equation for one row has 
the form 

0-12 - S12 - A • r i2 = 0, (36) 

Now given two values A < Ao, and solution S° at An, we form the sequential 
rule 

max|<7° 2 - s 12 | < 2A - A . (37) 

If this rule is satisfied, we discard the entire ith row and column of 0, and 
hence set them to zero (but retain the ith diagonal element). Figure 9 shows an 
example with iV = 100, p — 300, standard independent Gaussian variates. No 
violations of the rule occurred. 

Finally, we note that strong rules can be derived in a similar way, for other 
problems such as the group lasso (Yuan & Lin 2007). In particular, if denotes 
the n x pi block of the design matrix corresponding to the features in the £th 
group, then the strong sequential rule is simply 

||Xjr|| 2 < 2A- A max . 

When this holds, we set f3 e = 0. 
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8 Implementation and numerical studies 



The strong sequential rule (6) can be used to provide potential speed improve- 
ments in convex optimization problems. Generically, given a solution /3(Ao) and 
considering a new value A < A , let 5(A) be the indices of the predictors that 
survive the screening rule (6): we call this the strong set. Denote by E the 
eligible set of predictors. Then a useful strategy would be 

1. Set E = 5(A). 

2. Solve the problem at value A using only the predictors in E. 

3. Check the KKT conditions at this solution for all predictors. If there are 
no violations, we are done. Otherwise add the predictors that violate the 
KKT conditions to the set E, and repeat steps (b) and (c). 

Depending on how the optimization is done in step (b), this can be quite ef- 
fective. Now in the glmnet procedure, coordinate descent is used, with warm 
starts over a grid of decreasing values of A. In addition, an "ever-active" set 
of predictors A(X) is maintained, consisting of the indices of all predictors that 
have a non-zero coefficient for some A' greater than the current value A under 
consideration. The solution is first found for this active set: then the KKT 
conditions are checked for all predictors, if there are no violations, then we have 
the solution at A; otherwise we add the violators into the active set and repeat. 

The two strategies above are very similar, with one using the strong set 
5(A) and the other using the ever-active set A(X). Figure 10 shows the active 
and strong sets for an example. Although the strong rule greatly reduces the 
total number of predictors, it contains more predictors than the ever-active set; 
accordingly, violations occur more often in the ever-active set than the strong 
set. This effect is due to the high correlation between features and the fact 
that the signal variables have coefficients of the same sign. It also occurs with 
logistic regression with lower correlations, say 0.2. 

In light of this, we find that using both A(X) and 5(A) can be advantageous. 
For glmnet we adopt the following combined strategy: 

1. Set E = A(X). 

2. Solve the problem at value A using only the predictors in E. 

3. Check the KKT conditions at this solution for all predictors in 5(A). If 
there are violations, add these predictors into E, and go back to step (a) 
using the current solution as a warm start. 

4. Check the KKT conditions for all predictors. If there are no violations, 
we are done. Otherwise add these violators into A(X), recompute 5(A) 
and go back to step (a) using the current solution as a warm start. 

Note that violations in step (c) are fairly common, while those in step (d) are 
rare. Hence the fact that the size of 5(A) is <C p can make this an effective 
strategy. 
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Figure 10: Gaussian lasso setting, N = 200, p = 20,000, pairwise correlation 
between features of 0.7. The first 50 predictors have positive, decreasing co- 
efficients. Shown are the number of predictors left after applying the strong 
sequential rule (6) and the number that have ever been active (i.e. had a non- 
zero coefficient in the solution ) for values of A larger than the current value. 
A broken line with unit slope is added for reference. The right-hand plot is a 
zoomed version of the left plot. 
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We implemented this strategy and compare it to the standard glmnet algo- 
rithm in a variety of problems, shown in Tables 1-3. Details are given in the 
table captions. We see that the new strategy offers a speedup factor of five or 
more in some cases, and never seems to slow things down. 

The strong sequential rules also have the potential for space savings. With a 
large dataset, one could compute the inner products {xjr}^ offline to determine 
the strong set of predictors, and then carry out the intensive optimization steps 
in memory using just this subset of the predictors. 

9 Discussion 

In this paper we have proposed strong global and sequential rules for discarding 
predictors in statistical convex optimization problems such as the lasso. When 
combined with checks of the KKT conditions, these can offer substantial im- 
provements in speed while still yielding the exact solution. We plan to include 
these rules in a future version of the glmnet package. 

The RECSAFE method uses the solution at a given point Ao to derive a rule 
for discarding predictors at A < Ao- Here is another way to (potentially) apply 
the SAFE rule in a sequential manner. Suppose that we have 0o = /3(Ao), and 
r = y — X/3 , and we consider the fit at A < A , with r = y — X/3 - Defining 

Ao = maxjdxjrl); (38) 

we discard predictor j if 

IxJrKA-HrHM^ (39) 

We have been unable to prove the correctness of this rule, and do not know if it 
is infallible. At the same time, we have been not been able to find a numerical 
example in which it fails. 
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Table 1: Glmnet timings (seconds) for fitting a lasso problem in the Gaussian 
setting. In the first four columns, there are p = 100, 000 predictors, N = 200 
observations, 30 nonzero coefficients, with the same value and signs alternat- 
ing; signal-to-noise ratio equal to 3. In the rightmost column, the data matrix 
is sparse, consisting of just zeros and ones, with 0.1% of the values equal to 
1. There are p = 50,000 predictors, N = 500 observations, with 25% of the 
coefficients nonzero, having a Gaussian distribution; signal-to-noise ratio equal 
to 4.3. 



Method 


Population correlation 

0.0 0.25 0.5 0.75 Sparse 


glmnet 

with seq-strong 


4.07 6.13 9.50 17.70 4.14 
2.50 2.54 2.62 2.98 2.52 



Table 2: Glmnet timings (seconds) for fitting an elastic net problem. There are 
p = 100, 000 predictors, N = 200 observations, 30 nonzero coefficients, with the 
same value and signs alternating; signal-to-noise ratio equal to 3 



Method 


a 

1.0 0.5 0.2 0.1 0.01 


glmnet 

with seq-strong 


9.49 7.98 5.88 5.34 5.26 
2.64 2.65 2.73 2.99 5.44 



Table 3: Glmnet timings (seconds) fitting a lasso/logistic regression problem. 
Here the data matrix is sparse, consisting of just zeros and ones, with 0.1% of 
the values equal to 1. There are p = 50,000 predictors, N = 800 observations, 
with 30% of the coefficients nonzero, with the same value and signs alternating; 
Bayes error equal to 3%. 



Method 


Population correlation 
0.0 0.5 0.8 


glmnet 

with seq-strong 


11.71 12.41 12.69 
6.31 9.491 12.86 
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